READ-BAD: A New Dataset and Evaluation Scheme for Baseline Detection in Archival Documents
نویسندگان
چکیده
Text line detection is crucial for any application associated with Automatic Text Recognition or Keyword Spotting. Modern algorithms perform good on well-established datasets since they either comprise clean data or simple/homogeneous page layouts. We have collected and annotated 2036 archival document images from different locations and time periods. The dataset contains varying page layouts and degradations that challenge text line segmentation methods. Well established text line segmentation evaluation schemes such as the Detection Rate or Recognition Accuracy demand for binarized data that is annotated on a pixel level. Producing groundtruth by these means is laborious and not needed to determine a method’s quality. In this paper we propose a new evaluation scheme that is based on baselines. The proposed scheme has no need for binarization, it can handle skewed and rotated text lines and its results correlate with Handwritten Text Recognition accuracy. The ICDAR 2017 Competition on Baseline Detection and the ICDAR 2017 Competition on Layout Analysis for Challenging Medieval Manuscripts make use of this evaluation scheme.
منابع مشابه
Protection of Archival Documents from Photochemical Eects
Purpose: The purpose of this paper is to highlight the destructive effects of light on archival documents/paper materials. The research aims to explain the mechanism of photochemical degradation and the damaging effect of light on paper. It also tells us about the measures to be adopted to control the deteriorating effects of light on paper step by step. Design/Methodology/Approach: The res...
متن کاملFact-Checking of Reports in Kalamat-e Anjoman Using Archival Documents on the History of Kashan during the Qajar Era
This research aims to do a content review on the materials printed and published in “Kalamat-e Anjoman” about the history of Kashan. This study also assesses these materials using archival documents in order to confirm or refute the contents. This research used a descriptive/analytical method and the data were obtained from “Kalamat-e Anjoman” and archival documents. Findings show that Abdolras...
متن کاملAssessment Methodology for Anomaly-Based Intrusion Detection in Cloud Computing
Cloud computing has become an attractive target for attackers as the mainstream technologies in the cloud, such as the virtualization and multitenancy, permit multiple users to utilize the same physical resource, thereby posing the so-called problem of internal facing security. Moreover, the traditional network-based intrusion detection systems (IDSs) are ineffective to be deployed in the cloud...
متن کاملEffective Learning to Rank Persian Web Content
Persian language is one of the most widely used languages in the Web environment. Hence, the Persian Web includes invaluable information that is required to be retrieved effectively. Similar to other languages, ranking algorithms for the Persian Web content, deal with different challenges, such as applicability issues in real-world situations as well as the lack of user modeling. CF-Rank, as a ...
متن کاملThe Reading Crisis in Iran (During the 1960s and 1970s): A Critical Discourse Analysis
Purpose: Reading is one of the challenging problems in contemporary Iran. After the Persian Constitutional Revolution (1905-1911), reading becomes one of the factors that Iranians considered it necessary for modernization and development. For this reason, most people, even who were literate, had no desire to read. This situation was unpleasant for intellectuals, publishers and cultural activist...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1705.03311 شماره
صفحات -
تاریخ انتشار 2017